During 2020-2021, many hockey leagues (including NHL feeder leagues) had shortened seasons or no season due to the restrictions on play caused by the COVID-19 pandemic. Some players experiencing restrictions on play in played in other leagues or tournaments during the 2021-2022 season. Others did not play any league/tournament games during the 2020-2021 season. This poses the question of whether not playing games during the 2020-2021 COVID season negatively impacted player development (or caused players to get worse). To answer this question, we will examine data from the Ontario Hockey League, which did not play any games during the 2020-2021 season. Some players from this league chose to play in other leagues or tournaments while others did not.
Filtering:
2019-2020 (“pre-COVID”) and 2021-2022 (“post-COVID”) seasons
league == OHL
Only players who played in the OHL during both pre- and post-COVID seasons
Variables added:
points per game per season (combined if a player played for multiple teams in a season)
games played per season (combined if a player played for multiple teams in a season)
treatment (i.e. whether a player played more than 10 games during the COVID season)
age (approximately the oldest a player was in a given season)
player quality approximated by ppg in pre-COVID season
whether a player was drafted (not totally up to date)
Jitter plot / strip plot
Violin
Beeswarm
Density plot with rugs
But is this difference “real”? Aka did not playing during the COVID season cause players to get worse at hockey or can this difference simply be explained by confounding variables?
check if explanatory variables are correlated with each other and response.
Takeaways:
players who have inflated ppgs with low games played don’t seem to be a concern
There could be some kind of relationship here -> PPG seems to increase with GP
Takeaways:
Likely because players get more skilled as they get older, and we are only including players who played both pre- and post-COVID
Something to control for in our model
Takeaways:
Forwards score more than defensemen, which is obvious.
But, this becomes problematic with the way we’re measuring player quality…
Problems with player quality and PPG
Takeaways:
Forwards weighted as better players in our model becuase of our biased metric
Drafted vs not drafted another way to measure quality, but few players are drafted, and it can be an all or nothing way to measure quality.
Takeaways:
What’s the distribution of age?
Does PPG increase as players get older?
Does player quality increase as players get older?
Were older players more likely to play during COVID (since older players are likely higher quality players)?
Takeaways:
Do older players play more games?
Did player quality influence whether someone played during COVID season?
Takeaways:
##
## Call:
## lm(formula = ppg ~ got_drafted, data = recent2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.7204 -0.3837 -0.1204 0.2796 3.4210
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.57897 0.01235 46.897 < 2e-16 ***
## got_draftedYes 0.14140 0.02837 4.984 6.75e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5096 on 2100 degrees of freedom
## (1 observation deleted due to missingness)
## Multiple R-squared: 0.01169, Adjusted R-squared: 0.01122
## F-statistic: 24.84 on 1 and 2100 DF, p-value: 6.748e-07
##
## Call:
## lm(formula = ppg ~ overall_pick_num, data = drafted)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.8521 -0.3705 -0.0767 0.3327 1.5990
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.9068616 0.0457513 19.822 < 2e-16 ***
## overall_pick_num -0.0019559 0.0004067 -4.809 2.16e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4843 on 396 degrees of freedom
## Multiple R-squared: 0.05518, Adjusted R-squared: 0.0528
## F-statistic: 23.13 on 1 and 396 DF, p-value: 2.157e-06
Coefficient for ‘overall_pick_number’ is small, but we can’t use these statistics because model does not meet conditions for inference.
The jitter plot shows that some players drafted in the seventh round played about as many games as those drafted in earlier rounds.
Defensemen have less points and lower points per game than forwards. Players from different draft rounds are intermixed.
Points per game was slightly higher in the 2019-2020 season.
No…
No… we need a more complex model.
How do we tell if someone is a first liner?
How do we know if it is a tournament or league (since they are the same variable)?
Should we be looking at pts or g?
PPG is likely not adequately measuring player performance (especially for defensemen)
Multiple regression model accounting for all interactions between explanatory variables does not meet conditions for inference.
Next steps for building models/maybe starting causal inference? …